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Abstract 

Random survival forests (RSF) are a powerful method for risk prediction of 
right-censored outcomes in biomedical research. RSF use the log-rank split 
criterion to form an ensemble of survival trees. The most common approach 
to evaluate the prediction accuracy of a RSF model is Harrell’s concordance 
index for survival data (‘C index’). Conceptually, this strategy implies that the 
split criterion in RSF is different from the evaluation criterion of interest. This 
discrepancy can be overcome by using Harrell’s C for both node splitting and 
evaluation. We compare the difference between the two split criteria analytically 
and in simulation studies with respect to the preference of more unbalanced 
splits, termed end-cut preference (ECP). Specifically, we show that the log-rank 
statistic has a stronger ECP compared to the C index. In simulation studies and 
with the help of two medical data sets we demonstrate that the accuracy of RSF 
predictions, as measured by Harrell’s C, can be improved if the log-rank statistic 
is replaced by the C index for node splitting. This is especially true in situations 
where the censoring rate or the fraction of informative continuous predictor 
variables is high. Conversely, log-rank splitting is preferable in noisy scenarios. 
Both C-based and log-rank splitting are implemented in the R package rcuiger. 
We recommend Harrell’s C as split criterion for use in smaller scale clinical 
studies and the log-rank split criterion for use in large-scale ‘omics’ studies. 
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Highlights 

• Harrell’s C is proposed as a split criterion in random survival forests. 

• Split points of continuous predictor variables differ substantially between 
Harrell’s C and log-rank splitting. 

• The log-rank statistic has a stronger end-cut preference than Harrell’s C. 

• Harrell’s C outperforms log-rank splitting in smaller scale studies. 

• Harrell’s C outperforms log-rank splitting if the censoring rate is high. 


1. Introduction 


Random forests are among the most powerful methods for risk prediction in 
the biomedical sciences. The basic idea of random forests is to fit an ensemble 
of classification and regression trees (CART) to bootstrap samples that are 
generated from a set of learning data ( Breiman] 2001). Ensemble predictions 


are obtained by averaging predictions from the individual trees (Kruppa et al. 


2014). An important element of random forests is that only a small number of 


the predictor variables is made available for splitting, which is done at random 
in each node of a tree. With this randomization element, trees are decorrelated, 
and the variance of the ensemble prediction is reduced. The random selection 
of predictors also constitutes the main difference between random forests and 
earlier tree-based ensemble methods, such as bootstrap aggregating (bagging; 


Breiman 1996). 


Random forests were originally proposed for classifying dichotomous out¬ 


comes (Breiman 2001) and have been extended over the past 15 years in a 


number of ways. For example, various methods have been developed for judg¬ 
ing the importance of predictor variables, which may serve as a basis for variable 


selection (Di'az-Uriarte and Alvarez de Andres 2006 Ishwaran et al. 2011). It is 


also possible to estimate individual probabilities for both dichotomous and cat- 


egorical outcomes (Kruppa et al. 

2013 

) and to analyze continuous outcomes as 

well as right-censored event times ( 

Ishwaran et al. 2008 

). Finally, considerable 


progress has been made in understanding the statistical properties of random 
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forests, including results on consistency and asymptotic normality (Biau 2012 


l^rlot and Genuer] |2014[ [Scornet et al.[ |2015[ [Wager and Walttiei] |2015| jWagerj 


and Athey 2015 Mentch and Hooker 2016). Reviews can be found elsewhere; 


see, e.g.. 


Boulesteix et al. (2012); Touw et al. (2013); Kruppa et al. (2014); 


Ziegler and Konig (2014). 


The standard approach to analyze survival outcomes with random forests is 
termed random survival forests (RSF; [Ishwaran et al. 2008). In RSF, an en¬ 
semble of survival trees is built, and tree splitting is performed by maximizing 
the log-rank statistic in each node. Ensemble predictions are given by averages 
over the cumulative hazard estimates in the terminal nodes of the trees, as esti¬ 
mated by the Nelson-Aalen estimator. The most common approach to evaluate 
the predictive performance of the ensemble is the calculation of the C statistic 


for survival data, also termed ‘Harrell’s C” (Harrell et al., 1982 Ishwaran et al. 


2008). A value of C = 0.5 corresponds to a non-informative prediction rule 


whereas (7=1 corresponds to perfect association, implying that Harrell’s C 
is an easy-to-interpret coefficient that accounts for the whole range of the ob¬ 


served survival times (Schmid and Potapov 2012). In biomedical applications 


in particular in the analysis of gene expression data, C often ranges between 
the values 0.6 and 0.75. For example, estimates in this range were reported, 


l^mong others, by Van Belle et al. (2011), Schroder et al. (2011) and Zhang 


et al. (2013). A remaining disadvantage of the RSF approach with (7-based 


evaluation, however, is that the split criterion used for tree building is differ¬ 
ent from the performance criterion used to measure prediction accuracy. As a 
result, the performance measure of interest, i.e. Harrell’s (7, may not be fully 
optimized by the log-rank splits and may even have characteristics that are not 
reflected by the log-rank statistic. 

In this work, we therefore investigate whether the performance of RSF can 
be improved if Harrell’s (7 is used for both node splitting and the evaluation of 
prediction accuracy. In other words, the idea is to replace the log-rank split 
criterion by Harrell’s (7 and to determine split points that are optimal with 
respect to Harrell’s C in each node. In Sectionj^we first provide a description of 
the random forest algorithm for survival data, which is followed by theoretical 
considerations on the two split criteria. In two simulation studies and with 
the re-analysis of two cancer data sets (Section we finally demonstrate that 
the use of Harrell’s C can lead to systematic improvements in the predictive 
performance of RSF. 
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2. Methods 


2.1. Random survival forests 

Algorithm provides a description of the RSF algorithm for n independent 
observations and p predictor variables. Before the algorithm starts, the num¬ 
ber of trees, termed ntree, of the RSF and the number of predictor variables 
mtry available for splitting at each node need to be defined. Recently, |Lopes 


(2015) derived the limiting distribution of the prediction error for dichotomous 


endpoints and showed how this finding may be used for determining optimized 


values of ntree. Kruppa et al. (2013) demonstrated how the hyper-parameter 


mtry can be optimized. 

An important feature of the RSF algorithm is the use of the log-rank statistic 
to split observations at each node and in every tree (Step 2 in Figure[^. The log- 
rank statistic will be formally introduced below. At a specific node, the variable 
and the split point that maximize the log-rank statistic over all possible split 
points and all mtry variables are used for splitting. With this approach, the 
dissimilarity of the survival curves in the two children nodes is maximized. An 
alternative criterion for node splitting in Step 2 is Harrell’s C, which will also 
be considered below. 

The performance of the random survival forest is evaluated using indepen¬ 
dent test data in Steps 3 and 4 of the algorithm. If no independent data are 
available, the out-of-bag data generated in Step 1 are used to evaluate the pre¬ 
dictive performance. It is important to note that several summary measures are 
available in Step 3 of the algorithm. For example, Kaplan-Meier or Nelson-Aalen 
estimates can be derived in each terminal node, and results may be averaged 
over all trees. In addition, confidence intervals can be obtained for these es¬ 


timators (Wager et al. 2014 Wager and Walther 2015 Mentch and Hooker 


2016). 


2.2. The C statistic and the log-rank statistic as split criteria 

In this section we introduce the use of Harrell’s C as split criterion, and we 
start with a theoretical analysis of both Harrell’s C and the log-rank statistic. 
Specifically, we show that both split criteria are special cases of the Gehan statis¬ 


tic (Gehan 1965) and that they can be obtained from the Gehan statistic by 


applying different standardization and weighting schemes. Since these schemes 
are different, differences can be expected between the two criteria regarding 
their splitting behavior in RSF. First, we introduce basic notation and provide 
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Step 0: Fix ntree and mtry. 

Step 1 : Draw ntree bootstrap samples with replacement from the learning 
data. 

Step 2: For each of the ntree bootstrap samples use the following random 
survival tree procedure: In each node, 

• draw mtry candidate predictor variables, 

• calculate the log-rank statistic for each possible split point of each of 
the mtry candidate predictor variables; select the predictor-split-point 
combination that maximizes the log-rank statistic; split the observa¬ 
tions into two children nodes, 

• continue tree growing as long as the number of non-censored obser¬ 
vations in each node is larger than a pre-specified minimum terminal 
node size, termed nodesize. 

Step 3: For a new observation, drop the observation down to the final nodes 
of a tree and do this for all ntree trees. Predictions can be summarized 
over all trees in several ways. The most common strategy is to apply 
the Nelson-Aalen estimator in each terminal node and to average the 
resulting estimates of the cumulative hazard function over all trees and 
time points to give a one-dimensional score. 

Step 4: For a set of test data, prediction accuracy is evaluated by calculat¬ 
ing the score described in Step 3 for each observation and by evaluating 
the C statistic. 


Figure 1: Schematic overview 


of the RSF algorithm proposed by [Ishwaran et al.| ||2008|l. 


formal definitions of the log-rank, C and Gehan statistics. Second, we analyze 
the connection between the measures. Third, we provide a description of how 
Harrell’s C should be used as a split criterion in random forests for survival 
data. 

Notation 

Throughout this paper we assume that RSF are fitted to a set of independent 
and identically distributed data of size n. The data are represented by vectors 
(Ti, Ai, Xii ,..., Xip), j = 1,..., n, where is a possibly right-censored contin¬ 
uous survival time and {Xu,... ,Xip)^ is a vector of predictor variables. It is 
assumed that Ti is the minimum of the true survival time Ti and an independent 
continuous censoring time Ci. The variable := l{Ti < Ci) indicates whether 
Ti has been fully observed (A^ = 1) or not (Aj = 0). To simplify notation, we 
assume that there are no tied observed survival times in the data. A predictor 
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variable Xj, j £ is called non-informative if the distribution of T 

conditional on Xj does not depend on Xj. Otherwise, Xj is called informative. 

The events are observed at K ordered time points < ... < t(^K) with 
K < n. The numbers of events and observations at risk at k = 1,... ,K, 
are denoted by dk and Yfc, respectively. 

As described in Step 3 of the RSF algorithm (Figure [^, the outcome of an 
RSF is calculated from the cumulative hazard estimates in the terminal nodes. 
A one-dimensional score 77 ^ £ K is estimated for each observation i = 1,..., n, by 
averaging the cumulative hazard estimates over all trees and all time points 


Definition of Harrell’s C 


Harrell’s C (Harrell et al. 19821 is given by 


C = 






( 1 ) 


where the indices i and j refer to pairs of observations in the sample. The 
C statistic is the number of concordant pairs of observations divided by the 
number of comparable pairs. Multiplication by the factor in Eq. ([^ discards 
pairs of observations that are not comparable because the smaller survival time 
is censored, i.e., Xj = 0 . 

Harrell’s C is designed to estimate the concordance probability 
> E)> which compares the rankings of two independent pairs of 
survival times Ti,Tj and predictions rji^rjj. The concordance probability evalu¬ 
ates whether large values of 77 ^ are associated with small values of and vice 
versa. Harrell’ C can also be interpreted as a summary measure of the area(s) 


[jmder the time-dependent ROC curves ( [Heagerty and Zheng[ |2005[ [Schmid and 
|Potapov] 2012). A value of C = 0.5 corresponds to a non-informative prediction 
rule, whereas C = 1 corresponds to perfect association. It has thus been argued 
that Harrell’s C is an easy-to-interpret coefficient that accounts for the whole 


range of the observed survival times. Subsequently, Ishwaran et al. (2008) pro¬ 


posed to use Harrell’s C for evaluating the predictive performance of an RSF 
model. 

In order to use Harrell’s C as a split criterion in RSF, it is necessary to 
define appropriate values of the score 77 . For this purpose, we assume that the 
observations in a node under consideration are split into two disjoint children 
nodes Qo and Qi according to the split point (or “threshold”) of some candi- 


6 


















date variable. Hence, to evaluate the goodness of the split using Harrell’s C, 
we define 7 ^ := l{i G Gi) G { 0 , 1 } and estimate the concordance probability 
P(7i < 7i I P* > Tj) = P(i e GoJ e t/i I Ti > Tj) by 


0-5 ■ m > Tj) ■ l{i G Go,3 e go) ■ A, 

0.5 • 1(1} > E) • I(* gGujG Gi) ■ A, 

-|- -:3-- . (2) 

The definition of ([^ implies that a value of 0.5 is assigned to pairs of observations 
belonging to the same child node. 


The log-rank statistic 

The log-rank statistic is defined by 


Plog—rank — 


~ Aife ■ dk/Yk))'^ 

Ek YikYok ■ dk{Yk - dk)l[Y^{Yk - 1 )] 


(3) 


where dok,dik and Y^kjYik refer to the numbers of events and observations at 
risk in groups and Gi , respectively. It is a popular split criterion in survival 


trees (LeBlanc and Crowley 1993) and has been adopted by Ishwaran et al. 


(2008) for use in RSF. 


The Gehan statistic 


The Gehan statistic (Gehan, 1965) is given by 


u= Y. - E ■ (4) 

ieQojeQi ieQo.jeQi 


Assigning the value 1 to pairs with Ti > Tj and the value —1 to pairs with 
Ti < Tj, the Gehan statistic evaluates whether survival times in Go are system¬ 
atically larger than those in ^ 1 . Pairs of observations are only considered if the 
smaller survival time is not censored. 
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Relationship between the Gehan statistic and the log-rank statistic 

Standardization and squaring of the Gehan statistic U yields the Gehan- 
Wilcoxon statistic, which can be written as 


^Gehan ■ — 


^ iEkYk-{d,k-Y,k-dk/Y,)f 

var(G) Efc Y^ ■ Y^kYok ■ dk{Yk - dk)/[Y^{Yk - 1)] 


(5) 


(Gehan 1965 Tarone and Ware] 19771. Eq. ([^ shows that the Gehan-Wilcoxon 
statistic is a weighted version of the log-rank statistic Tiog_rank- Specifically, 
Tbehan Weights the summands in the numerator and denominator of Tjog-rank 
by the number of observations at risk, thereby assigning higher weight to events 
at early time points. This relationship gives rise to the more general class of 


Tarone-Ware test statistics (Tarone and Ware 1977). 


Relationship between the Gehan statistic and Harrell’s C 

The Gehan statistic of Eq. ([^ can equivalently be written as 


u = Y. Y 

ieGoJeGi leQojeQi 

= 2- Y m>f,)-A,-N, 


( 6 ) 


where N is the number of possible comparisons between Qq and Qi. Similarly, 
the numerator of Harrell’s C in (§ can be written as 

Y KT, > fj) ■ Aj + 0.5 • No + 0.5 • Ni , (7) 

jedo.ieei 


where No and Ni refer to the number of comparable pairs in groups Qo and Qi, 
respectively. Eqns. (j^ and (j^ both depend on the term jgGi ^ 

Tj) ■ Aj. As a consequence, Harrell’s C is linearly related to the Gehan statistic. 


Gonclusion about the relationships 
The above considerations show 

(i) that both Harrell’s C and log-rank statistic are based on the same goodness- 
of-split criterion, namely the Gehan statistic, 

(ii) that different standardization and weighting schemes are applied to trans¬ 
form the Gehan statistic into Harrell’s C and the log-rank statistic. 













(iii) that both measures include additional terms which become large if the 
groups Qo and Qi are unbalanced. For Tiog-rank this is seen by consider¬ 
ing the denominator in Eq. which depends on the products Yik ■ Fbfc. 
The latter term becomes small if Qq and Qi are unbalanced, effectively 
increasing Tcehan- It also increases Tiog_rank, whose denominator depends 
on the same products. In a similar way, Eq. 0 shows that Harrell’s C 
depends on the sum iVo -I- fVi, which becomes large if Qo and Qi are un¬ 
balanced. 

The effect of these weighting schemes on RSF performance is investigated in 
Section |3l 


2.3. Use of Harrell’s C as split criterion in RSF 

The definition of the binary values 7 ^ = l{i G Qi) allows for the following 
strategy to use Harrell’s C as split criterion in RSF: In Step 2 of the RSF 
algorithm (Figure [^, the C statistic is evaluated in each node for all possible 
split points of all mtry candidate variables. In case C < 0.5, node labels are 
switched and Harrell’s C is replaced by 1 — C. Based on the obtained set of 
C values, the predictor-split-point combination maximizing Harrell’s C is used 
to split the observations into the children nodes Qo and Qi. To account for 
the fact that Harrell’s C discards pairs of observations with censored smaller 
observed time, which is known to result in an upward-bias in the estimation of 


the concordance probability P(i G Qo,j G Qi\Ti > Tj) (Schmid and Potapov 


20121, one could additionally use the estimator proposed by Uno et al. (20111 


(‘Uno’s C’) for node splitting. Uno’s C ensures consistent estimation of the 
concordance probability by inverse probability-of-censoring (IPC) weighting of 
the terms in Here, we do not propose to replace Harrell’s C by Uno’s C, as 
this strategy requires the additional estimation of the IPC weights in each node. 
Apart from the computational cost, this estimation step is not feasible in deeply 
grown trees with small minimum node size. Furthermore, we demonstrate in 
simulation study 2 that the conclusions are not altered when Harrell’s C is used 
instead of Uno’s C. 

RSF with log-rank-based and (Harreirs-)C'-based splitting are implemented 
in the R package ranger (Wright and Ziegler, 2016 options splitrule = "C" 
and splitrule = "logrank"). Default values of ntree and mtry are 500 
and ^Jp, respectively, where p is the number of predictor variables. 
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2.4- Simulation study 1 - threshold selection 

We compared the performance of the two split criteria by carrying out two 
simulation studies. In simulation study 1, we investigated the behavior of the 
log-rank and C statistics with regard to threshold selection. The purpose of this 
study was to show that the thresholds of a given predictor variable that maxi¬ 
mize the two statistics systematically differ. Specifically, the log-rank statistic 
has a stronger tendency to generate unbalanced children nodes than Harrell’s C. 
In simulation study 2, we investigated the use of Harrell’s C and the log-rank 
statistic as split criteria in RSF and analyzed the prediction accuracy of the 
resulting model fits. 

To study differences in the threshold selection between the two split crite¬ 
ria, we first considered a single continuous predictor variable x that followed 
a uniform distribution on [—3,3] (simulation study l.a)). The sample size in 
simulation study l.a) was set to n = 1000 independent observations. Survival 
times were generated from a standard exponential distribution without taking 
the values of x into account. Consequently, simulation study l.a) reflected the 
situation where a non-informative predictor is considered for node splitting. 
Note that the use of a uniformly distributed predictor variable x corresponds 
to a very general scenario because every continuous predictor variable can be 
transformed into a uniformly distributed continuous variable by use of its distri¬ 
bution function. In particular, this transformation does not affect the splitting 
behavior of the two split criteria Harrell’s C and log-rank statistic. Censoring 
times were generated from an exponential distribution whose rate was adjusted 
such that 50% of the observed survival times were right-censored. 

In simulation study l.b) we set the sample size to n = 100 and investigated 
the tendency of the two split criteria to generate unbalanced children nodes in 
settings with an informative predictor variable. Specifically, we investigated the 
threshold models T = exp(I(a; > 0.25) -be) and T = exp(I(a; > 0.75) -be), with x 
and e following independent standard uniform and normal distributions, respec¬ 
tively. For both threshold models, three censoring rates (25%, 50% and 75%) 
were considered. Censoring times were generated in the same way as before. In 
both parts of simulation study 1, 1000 replications were used. 

2.5. Simulation study 2 - predictive performance of RSF 

To study the effects of C-based and log-rank-based splitting on the perfor¬ 
mance of RSF, we based the data-generating model on four informative predictor 
variables xi,... ,X 4 that followed a multivariate standard normal distribution 
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with pairwise correlation p = 0.5. After random number generation, the values 
of Xi,..., a ;4 were rounded to multiples of 0.1. To investigate the effect of many 
noise variables, we also considered models with p = 10, p = 505, and p = 1005 
predictor variables. This was done by adding non-informative variables to the 
data sets that followed the same distribution as Xi,... ,X 4 . Two sample sizes 
were considered in this simulation study (n = 100, 300). 

To generate the observed survival times, we first dropped the values of the 
predictor variables down a probability tree model with fixed split rules and 
thresholds (Figure [^. This procedure was equivalent to the application of a 
non-random function of the predictor variables and resulted in a set of one¬ 
dimensional location parameters Xi S [0,1], i = l,...,n. Next, the Xi values 
were incorporated into the Weibull model In(Ti) = Xi + a ■ Si from which the 
survival times Ti were generated. The noise variable was chosen independently 
of Xi and followed a standard extreme value distribution. The parameter a 
quantified the amount of noise added to the location parameter Xi. It was 
adjusted such that the empirical standard deviation of 7 ^ was equal to a. This 
combination of Xi and the Weibull model resulted in a data-generating process 
that was a combination of a tree model and a proportional hazards model. 
We considered three censoring rates (25%, 50%, 75%). Censoring times were 
generated as before (simulation study 1 ). 

In simulation study 2.a), the true values of the predictor variables were 
used for RSF estimation. In simulation study 2.b), we dichotomized the pre¬ 
dictor variables I(xj > 0 ), j = 1 ,... ,p, and investigated the behavior of log- 
rank and C-based splitting in the presence of categorical predictors. The tree 
model defined in Figure]^ was again used generate the set of values Xi € [0,1], 
i = 1 ,..., n. 

RSF were estimated using the ramger package with ntree = 500 and min¬ 
imum terminal nodesize = 3. The other RSF parameters were chosen by 
default values. Prediction accuracy was measured by Harrell’s C. In simulation 
study 2.a) we also investigated the performance of RSF when Uno’s C was used 
for evaluation. Both measures were evaluated on 500 independent test data sets 
of size n = 1000 each. The censoring rates in the test data sets were the same 
as those in the corresponding learning data sets. 
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Figure 2: Probability tree model to generate the location parameters Aj in simulation study 2. 
The numbers in the terminal nodes refer to the values of A^ that were obtained by dropping 
the values of the predictor variables xi,... ,^4 down the tree. 


2.6. Data analysis 1 - gene expression data to predict the time to distant metas- 
tases in breast cancer 

In the first data analysis, we applied RSF to gene expression data for metas¬ 
tasis-free survival in patients with node-negative breast cancer. Data were orig¬ 


inally collected by Desmedt et al. (2007) to evaluate a 76-gene expression signa¬ 


ture derived from Affymetrix microarrays (Wang et al. 2005) in 198 patients. 


The outcome variable was the time from diagnosis to distant metastases. In 
addition to the expression levels of the 76 genes we considered the five clin¬ 
ical variables estrogen receptor status (positive/negative), tumor size (mm), 
tumor grade (poor/indermediate/good differentiation), age (years) and center 


(five hospitals, see Desmedt et al. 2007 for details). The data are publicly 
available as part of the Gene Expression Omnibus (GEO) database (accession 
number GSE7390). Two observations with missing information on tumor grade 
were dropped from statistical analysis. Following the strategy by [Desmedt et al(] 


(2007), observed survival times were censored at 10 years. Observed metastasis- 


free survival times then ranged from 125 days to 3652 days, with 79.08% of the 
survival times being censored. 

In the first step, we considered the five clinical variables only and fitted 
RSF to subsamples of sizes | (n = 64), f (n = 97) and | (n = 130) of the 
196 observations. For each of the three sample sizes, 100 random subsamples 
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were generated. RSF predictions were evaluated on 100 subsamples of size 66 
that were not part of the learning data sets. The same evaluation data could 
thus be used for all three sizes of the learning data. Both Harrell’s C and the 
log-rank statistic were used for node splitting. The number of trees was set to 
ntree= 500. For all tuning parameters, we used default values. 

In the second step, we fitted RSF to the same 100 subsamples, this time 
considering both the clinical variables and the expression values of the 76 genes. 
Because all 76 gene expression variables were identified as being predictive for 


the time to distant metastases (Wang et al., 20051, it could be assumed that all 


learning data sets were free of noise variables. 

In the third step, we repeated the resampling analysis, this time adding var¬ 
ious numbers of non-informative variables (padd = 1000,1500, 2000,..., 5000) 
to the clinical variables and the gene expression variables. All non-informative 
variables were generated from a multivariate standard normal distribution with 
pairwise correlation p — 0.5. This last setting allows to compare the RSF results 
with the findings from the high-dimensional simulation scenario in simulation 
study 2.b). 


2.1. Data analysis 2 - survival in patients with diffuse large B-cell lymphoma 
In the second data analysis, we analyzed survival in patients with diffuse 


large B-cell lymphoma (Rosenwald et al., 20021 using RSF. The aim of our 


analysis was to investigate the performance of the two split criteria in a higher¬ 
dimensional scenario where the signal-to-noise ratio of the predictors was un¬ 


known. Data were originally collected by Rosenwald et al. (20021 and comprised 


p = 7399 gene expression values from Lymphochip DNA microarrays. Here, we 
[ponsider a subsample of n = 240 patients that was analyzed previously by|Binde7| 


and Schumacher (2008). The outcome was overall survival after chemotherapy. 


The median follow-up time was 2.8 years; 42.5% of the survival times were cen¬ 
sored. Because some of the microarray features contained missing values, we 


used the imputation method of Schumacher et al. (20071. 


We applied C-based and log-rank-based splitting as before, and RSF analysis 
was based on the same resampling approach as above. Specifically, we fitted RSF 
models to 100 subsamples of the original data using sample sizes of | (n = 80), 
i (n = 120) and | (n = 160) of the 240 observations. RSF predictions were 
evaluated on 100 subsamples of size 80 that were not part of the respective 
learning data sets. 
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Threshold 


Figure 3: Kernel density plots of the thresholds that were selected by the C and log-rank 
statistics in simulation study l.a). The thresholds refer to a non-informative predictor vari¬ 
able X that was uniformly distributed on [—3,3]. The kernel density plots therefore imply 
that Harrell’s C resulted in more balanced groupings of the observations than the log-rank 
statistic. 


3. Results 

3.1. Simulation study 1 - threshold selection 

Figure shows that the selected thresholds differed substantially for the 
two split criteria when a uniformly distributed predictor variable with no ex¬ 
planatory value was used. Specifically, the kernel density plot of the thresholds 
selected by Harrell’s C reaches its maximum near the midpoint of the inter¬ 
val [—3,3], whereas the corresponding density plot for the thresholds selected 
by the log-rank statistic is characterized by two distinct peaks near the bound¬ 
ary values —3 and 3. This implies that Harrell’s C resulted in more balanced 
groupings of the observations than the log-rank statistic. 

Simulation study l.b) confirms the observation that Harrell’s C resulted in 
more balanced children nodes than the log-rank statistic, as both the median 
threshold values and the whole empirical threshold distributions obtained from 
the C statistic were throughout closer to 0.5 than those obtained from the log- 
rank statistic (Figure 1^. 

3.2. Simulation study 2 - •predictive performance of RSF 

The panels in the left column of Figure]^ (simulation study 2.a)) display 
the differences in RSF prediction accuracy between C-based and log-rank-based 
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a) True threshold = 0.25 



C log-rank C log-rank C log-rank 


Split criterion 


b) True threshold = 0.75 



C log-rank C log-rank C log-rank 


Split criterion 


Figure 4: Boxplots of the thresholds optimizing the C and log-rank statistics in simulation 
study l.b). The true thresholds were 0.25 and 0.75. On average, threshold values from the C 
statistic were closer to 0.5 when compared with the log-rank statistic. Thus, C-based splitting 
resulted in more balanced children nodes than log-rank splitting. 


splitting, as measured by Harrell’s C. The panels in the right column show 
the differences in RSF prediction accuracy when Uno’s C was used to evaluate 
predictive performance instead of Harrell’s C. Results obtained from Harrel’s C 
and Uno’s C were very similar throughout. 

All panels in Figure show systematic differences between C-based and log- 
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rank-based splitting. In most scenarios, C-based splitting improved the perfor¬ 
mance of RSF, which is indicated by an upward shift of the boxplots in Figure 
and by median values lying above the horizontal line at zero. The magnitude of 
the observed differences depended on the censoring rate, with higher censoring 
rates tending to result in an improved performance of C-based splitting rela¬ 
tive to log-rank-based splitting (see in particular Figure [^a)). Increasing the 
number of non-informative variables tended to improve the performance of the 
log-rank split criterion relative to Harrell’s C. For example, the performance of 
log-rank based RSF was better for 1001 non-informative predictors (Figurej^c)) 
than it was for six non-informative predictors (Figurej^a)). In the latter figure, 
most of the boxplots are characterized by a distinct upward shift, whereas most 
of the medians presented in the former figure are only slightly larger than zero. 

Figure|^shows the differences in RSF prediction accuracy that were obtained 
from the dichotomized predictor variables l{Xj > 0), j = 1,... ,p (simulation 
study 2.b)). Apart from the low-dimensional scenario with p = 10, where log- 
rank splitting outperformed C-based splitting, no systematic differences were 
observed between C-based and log-rank-based splitting. In fact, most of the 
boxplots in Figure are either close to zero or are centered around the hori¬ 
zontal line at zero. While substantial differences were found for the split points 
of C-based and log-rank splitting in the simulation studies with continuous 
predictor variables (see in particular Figure and also the upward shifts in 
Figure [^a)), split rules became very similar when predictor variables were di¬ 
chotomized (Figure [^. 

3.3. Data analysis 1 - gene expression data to predict the time to distant metas- 
tases in breast cancer 

Only a small increase in RSF prediction accuracy was observed when log- 
rank splitting was replaced by C-based splitting and when the five clinical vari¬ 
ables were used (see Figure lila), where all boxplots are close to zero.). This 
finding agrees well with the results of simulation study 2.b), as three of the five 
clinical variables were categorical (cf. Figure]^. In contrast, when the contin¬ 
uous gene expressions were added to the five clinical variables (Figure [^b)), 
prediction accuracy was substantially higher for RSF using C-based splitting 
compared to RSF using log-rank splitting. As seen from Figure [^b), even the 
lower quartiles of the differences between C-based and log-rank splitting were 
distinctly larger than zero. This result was observed for all learning sample 
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Figure 5: Results of simulation study 2.a). The panels show boxplots of the performance 
differences between C-based and log-rank-based splitting in RSF, as measured by Harrell’s C 
(left column) and Uno’s C (right column) using independent test data. C-based splitting 
performed better than log-rank-based splitting if the difference was > 0. 


sizes. In absolute value, median C estimates obtained from C-based splitting 
were 0.61/0.61/0.60 in the five-variable clinical model and 0.67/0.71/0.73 in the 
combined model using both clinical and gene expression data (learning sample 
sizes n = 64/97/130, respectively). The largest improvement obtained from 
C-based splitting was seen in the combined model with n = 130. In this case. 


17 







































































a) p = 10 


c)p=1005 



b) p = 505 



Censoring rate 


ill 

T T -L 

F ^ t 


' 

1 ; 

n = 100 

n = 300 


25% 50% 75% 25% 50% 75% 

Censoring rate 


Figure 6: Results of simulation study 2.b). The panels show boxplots of the performance 
differences between C-based and log-rank-based splitting in RSF, as measured by Harrell’s C 
using independent test data. Predictor variables were dichotomized, as described in Sec¬ 
tion |3.2| C-based splitting performed better than log-rank-based splitting if the difference 
was > 0. 
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Figure 7: Boxplots of the performance differences between C-based and log-rank-based split¬ 
ting in RSF for the breast cancer data ( [Desmedt et al.||2007 l. For all learning sample sizes, 
predictive performance was measured by calculating Harrell’s C from test data sets of size 
n = 66 each, a) Data analysis using the five clinical variables only, b) low-dimensional data 
analysis using clinical and gene expression data. 


the median C estimate obtained from log-rank splitting was 0.685, and the 
median improvement obtained from C-based splitting was 0.054 (95% confi¬ 
dence interval [0.039,0.063]), corresponding to a median performance gain of 
approximately 8%. With increasing numbers of noise variables (Figure]^, the 
performance of C-based splitting decreased relative to log-rank-based splitting. 

In addition to the evaluation of RSF prediction accuracy, we also calculated 
permutation-based variable importance scores from the complete data set with 
n = 196 observations. Figure[^shows that C-based and log-rank-based splitting 
yielded substantial differences in the permutation importance and the order of 
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Figure 8: Boxplots of the performance differences between C-based and log-rank-based split¬ 
ting in RSF for the high-dimensional analysis of the breast cancer data jPesmedt et al.[ |2007| l. 
Predictive performance was measured by calculating Harrell’s C from test data sets of size 
n = 66 each. Non-informative variables (padd = 1000,1500, 2000,..., 5000) were added to the 
clinical variables and the gene expression variables. 


the importance of the predictor variables. For example, although both splitting 
rules assigned the highest importance score to gene 202240_at, only 19 of the 30 
most important ‘log-rank’ predictors were also contained in the respective set of 
‘C-based’ predictors. Importantly, estrogen receptor status was the 10*^ most 
important variable in the set of C-based predictors, while it was only ranked 
^28 in the list of the most important predictor variables when log-rank splitting 
was used. Hospital location was among the most important predictor variables 
in both approaches, suggesting that the survival times showed considerable ge¬ 
ographic variation. 

Note that the main purpose of Figure is to demonstrate that the re¬ 
sults of C-based splitting have a meaningful interpretation in regard to the 
inclusion of clinically relevant predictor variables. Additional simulation stud¬ 
ies would be necessary to systematically investigate the relationship between 
permutation-based variable importance and C-based splitting. Also, there are 


-various alternatives to permutation-based variable importance (e.g., Ishwaran 


et al. 2011), which might be used for additional variable selection purposes in 


high-dimensional settings. Investigating the behavior of these measures in RSF 
with C-based splitting may constitute an issue of future research. 
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Figure 9: Scaled permutation-based importance scores for the 30 most important clinical or 
gene expression variables in the RSF analysis of the breast cancer data ( [Desmedt et al.[|2007[ |. 
a) C-based splitting, b) log-rank splitting. 

3.4- Data analysis 2 - survival in patients with diffuse large B-cell lymphoma 
Figure [T0| shows the main results of the RSF analysis for the diffuse large 
B-cell lymphoma gene expression data. A substantial gain was observed in RSF 
performance when log-rank splitting was replaced by C-based splitting, as all 
boxplots in Figure show a notable upward shift. This performance gain 
was observed for all sizes of the learning data. In absolute value, median C 
estimates obtained from C-based splitting were 0.60/0.61/0.62 (learning sam¬ 
ple sizes n = 80/120/160, respectively). The largest improvement obtained 


20 
































































o 

o 


n = 80 (33%) n = 120 (50%) n = 160 (66%) 
Size of learning sample 


Figure 10; Boxplots of the performance differences between C-based and log-rank-based split¬ 
ting in RSF for the high-dimensional diffuse large B-cell lymphoma data l |Rosenwald et ah] 
|2002[ |. Predictive performance was measured by calculating Harrell’s C from evaluation data 
sets of size n = 80 each. 


from C-based splitting was seen in the model with n = 160. In this case, the 
median C estimate obtained from log-rank splitting was 0.584, and the me¬ 
dian improvement obtained from C-based splitting was 0.028 (95% confidence 
interval [0.018,0.041]), corresponding to a median performance gain of approx¬ 
imately 5%. 


4. Discussion 

Random forests are an established machine learning technique in medical 


research ( 

Foster et al. 

2011 

Bacauskiene et al. 

2012 

Englund and Verikas 

2012 

Casanova et al. 

2014 

Slalderoni et al. 2011 

)). In survival analysis, RSF 


are not only a powerful tool for risk prediction but are also a nonparametric 
alternative to Cox regression in situations where the proportional hazards as¬ 


sumption is violated (Omurlu et al. 2009) or when the functional form of the 


predictor effects on survival is misspecified. 

In this work, we have proposed Harrell’s C as an alternative split criterion 
to log-rank splitting for RSF. Conceptually, this approach corresponds to a uni¬ 
fied analysis strategy, in which the split criterion is identical to the evaluation 
criterion of interest. Similar strategies have previously been developed for other 


statistical learning methods. For example, Mayr and Schmid (2014) proposed 


a gradient boosting algorithm for direct optimization of the concordance prob¬ 
ability, implying that the same performance criterion was used for both model 
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building and evaluation. In a similar way, several authors proposed support 
vector machines for censored data (Van Belle et al. 2011 Polsterl et al., 2015). 

Based on our analytical and numerical results, we identihed three situations 
in which RSF performance improved when Harrell’s C was used for node split¬ 
ting instead of the log-rank statistic. First, C-based splitting performed better 
than log-rank splitting in scenarios where the signal-to-noise ratio was high in 
the data. This finding can be explained by the more unbalanced children nodes 
obtained from log-rank splitting. In the CART literature, a tendency towards 
unbalanced splits is referred to as ‘end-cut preference’ (ECP) and has been a 


subject of debate for long (Breiman et al. 1984). In recent work, Ishwaran 


(2015) provided an in-depth analysis of the ECP phenomenon and discussed its 
consequences for the performance of random forests. In particular, Ishwaran 
showed that split rules with ECP are desirable for trees with a small minimum 
node size when a non-informative predictor variable is considered for splitting. 
This is because end-cut splits conserve the sample size and therefore allow trees 
to ‘recover’ from bad splits resulting from the splitting of a non-informative 
predictor variable. In the light of these findings, it is expected that split cri¬ 
teria with ECP increase the performance of RSF when applied to noisy data. 
Log-rank splitting, which tends to produce more unbalanced splits than Har¬ 
rell’s C, is therefore expected to result in an improved RSF performance when 
the percentage of non-informative predictors is high. This explains the good 
performance of the log-rank statistic under the noisy scenarios compared to 
Harrell’s C. 

Second, C-based splitting resulted in substantial performance gains when 
the number of informative continuous predictor variables was large compared 
to the number of categorical predictor variables. Conversely, C-based and log- 
rank-based splitting resulted in little difference in RSF performance when all 
predictor variables were categorical. This result is explained by the fact that 
differences in threshold selection become most apparent in the presence of con¬ 
tinuous predictor variables. 

Third, we observed that C-based splitting tended to outperform log-rank 
splitting in data sets with a high censoring rate. This result confirmed earlier 
findings by [Ishwaran et ^ (2008) who stated that ‘the performance of [log- 
rank-based] RF regression depended strongly on the censoring rate’, with the 
prediction accuracy of RSF being ‘poor’ in high-censoring scenarios. 

In summary, our results show that Harrell’s C as split criterion improves the 
performance of RSF in a wide range of settings. We recommend the use of C- 
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based splitting in smaller scale clinical studies and the use of log-rank splitting 
in large-scale ‘omics’ studies. 
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